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->  Abstract 

We  consider  optimizations  that  are  required  for  efficient  execution  of  code 
segments  that  consists  of  loops  over  distributed  data  structures.  The  PARTI 
execution  time  primitives  are  designed  to  carry  out  these  optimizations  and' 
can  be  used  to  implement  a  wide  range  of  scientific  algorithms  on  distributed 
memory  machines. 

These  primitives  allow  the  user  to  control  array  mappings  in  a  way  that 
gives  an  appearance  of  shared  memory.  Computations  can  be  based  on  a  global 
index  set.  Primitives  are  used  to  carry  out  gather  and  scatter  operations  on 
distributed  arrays.  Communications  patterns  are  derived  at  runtime,  and  the 
appropriate  send  and  receive  messages  are  automatically  generated. 
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1  Introduction 

Efficient  implementation  of  scientific  codes  on  distributed  memory  architectures  re¬ 
quires  special  techniques  both  at  run-time  and  at  compile-time.  The  PARTI  (Parallel 
Automated  Runtime  Toolkit  at  ICASE)  system  is  a  set  of  primitives  that  can  be  used 
to  implement  a  wide  range  of  scientific  algorithms  on  distributed  memory  machines. 
These  primitives  support  various  run-time  operations  required  by  programs  that 
make  use  of  an  embedded  shared  name  space  on  a  distributed  machine.  The  user 
can  carry  out  gather  and  scatter  operations  on  distributed  arrays  using  a  global  in¬ 
dex  set.  Communications  patterns  are  derived  at  runtime,  and  the  appropriate  send 
and  receive  messages  are  automatically  generated. 

The  PARTI  primitives  are  initialized  by  specifying  a  mapping  into  distributed 
memory  for  each  globally  defined  multidimensional  array.  The  primitives  include 
procedures  that  allow  one  to  scatter  and  gather  array  elements  in  the  distributed 
memory.  The  PARTI  tools  are  organized  into  two  levels.  The  lower  level  supports 
memory  operations  such  as  scatters  and  gathers  across  processors.  The  higher  level 
binds  mapping  information  to  distributed  arrays  and  uses  this  information  to  call  the 
lower  level  primitives.  The  primitives  allow  the  storage  of  information  about  memory 
access  patterns  so  that  memory  operations  with  the  same  address  calculations  need 
not  repeat  these  calculations.  Send/receive  schedules  are  generated  for  memory 
operations.  The  schedules  may  be  stored  and  reused  as  well.  This  is  particularly 
important  for  the  implementation  of  iterative  algorithms. 

The  ideas  incorporated  in  PARTI  are  specifically  aimed  at  computations  on  dis¬ 
tributed  memory  machines  in  which  the  structure  of  the  computation  depends  on 
the  input  data.  Run-time  support  must  be  incorporated  as  part  of  the  distributed 
implementation  of  such  computations.  Directly  incorporating  the  primitives  into 
applications  programs  allow  investigation  of  the  usefulness  and  relevance  of  various 
optimizations.  In  this  paper  we  demonstrate  and  benchmark  these  primitives  using 
two  different  programs.  The  first  is  an  adaptive  method  for  solving  partial  differential 
equations,  the  second  is  a  kernel  from  an  unstructured  mesh  code. 

1.1  Related  Research 

Williams  [16]  describes  a  programming  environment  for  calculations  with  unstruc¬ 
tured  triangular  meshes  using  distributed  memory  machines.  In  [16],  collections 
of  distributed  array  accesses  are  translated  into  an  efficient  set  of  inter-node  mes¬ 
sages.  Similar  mechanisms  for  translating  an  irregular  pattern  of  array  accesses  into 
inter-node  messages  have  been  proposed  in  order  to  make  it  possible  to  efficiently 
distribute  loops  where  some  array  references  are  made  through  a  level  of  indirection. 
Work  on  this  topic  was  presented  by  the  present  authors  in  [12], [8],  [9]  as  well  as  by 
Mehrotra  and  Van  Rosendale  [7,  6]. 

In  this  paper,  we  present  primitives  that  can  be  used  directly  by  programmers 
that  allow  users  to  carry  out  gather  and  scatter  operations  over  global  index  sets  on 
distributed  arrays. 

Callahan  and  Kennedy  [3],  Rogers  and  Pingali  [10],  and  Rosing  and  Schnabel  [11] 
suggest  execution  time  resolution  of  communications  on  distributed  machi;  -  -  Xone 
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of  these  utilize  information  on  repeated  patterns  of  communications.  The  Linda 
system[l]  provides  an  associative  addressing  scheme  by  which  a  reference  to  variables 
can  be  resolved  at  execution  time.  This  in  essence  provides  a  shared  name  space  for 
distributed  memory  machines;  however,  the  shared  name  space  does  not  allow  users 
to  determine  how  data  is  to  be  partitioned  between  processors.  The  costs  associated 
with  this  lack  of  data  locality  can  be  extremely  high  in  some  cases  [12]. 


2  PARTI  Primitives 

The  PARTI  primitives  currently  consist  of  two  levels.  The  most  fundamental  of 
the  primitives  are  the  Level  0  Primitives.  They  consist  of  routines  to  gather  and/or 
scatter  (read  and  write)  values  to  elements  of  one  dimensional  arrays  aloc^  defined  on 
each  processor  j.  Each  aloe7  is  local  to  processor  j\  it  is  not  viewed  as  a  distributed 
array  by  the  Level  0  Primitives. 

2.1  Level  0  Gather  and  Scatter 

Level  0  gathers  and  scatters  are  accomplished  by  using  three  routines:  Scheduler, 
Gather  Exchanger ,  and  Scatter  Exchanger. 

Scheduler  on  processor  P1  is  passed  a  list  of  indices  Kj  into  each  aloe 3  from  which 
data  is  to  be  fetched  and  produces  a  schedule  S  that  is  used  by  both  exchangers. 

On  processor  P‘,  Gather  Exchanger  inputs 

1.  a  buffer  into  which  the  fetched  elements  are  to  be  placed 

2.  the  location  of  array  aloe * 

3.  the  schedule  S  produced  by  Scheduler 

Gather  Exchanger  executes  sends  and  receives  that  fetch  from  each  processor  P3  the 
appropriate  elements  from  the  array  aloe* .  Then  it  places  these  elements  into  the 
user-supplied  buffer. 

Scatter  Exchanger  is  passed 

1.  a  buffer  from  which  each  scattered  datum  is  to  be  obtained 

2.  the  location  of  array  aloe ‘ 

3.  the  schedule  S  produced  by  Scheduler 

Scatter  Exchanger  executes  sends  and  receives  that  put  on  each  processor  P3  the 
appropriate  elements  from  the  buffer.  Then  Scatter  Exchanger  places  these  elements 
into  the  appropriate  elements  of  array  aloe3. 
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2.1.1  Functioning  of  the  Scheduler  and  Exchangers 

Exchange  procedures  for  both  the  scatter  and  the  gather  have  three  stages.  They 
permute  data  into  buffers  to  be  sent.  They  carry  out  the  needed  communication, 
then  they  perform  another  permutation. 

The  scheduler  first  determines  how  many  messages  each  processor  must  send 
and  receive  during  the  data  exchange  phase.  Defined  on  each  processor  P1  is  an 
array  nmsgs1.  Each  processor  sets  its  value  of  nmsgs1(j)  to  1  if  it  needs  data  from 
processor  j  or  to  0  if  it  does  not.  The  scheduler  then  replaces  nmsgs  with  the  element- 
by-element  sum  nmsgs1(j)  *—  Y,k  nmsgsk(j).  This  operation  utilizes  a  function  that 
imposes  a  fan-in  tree  to  find  the  sums.  Since  the  resulting  sum  is  kept  in  nmsgs1, 
at  the  end  of  the  fan-in  on  every  processor,  nmsgsi(j)  is  the  number  of  messages 
that  processor  P1  must  send  during  the  exchange  phase.  Next,  each  processor  sends 
a  request  list  to  every  other  processor.  The  request  list  sent  from  processor  Pp  to 
processor  Pq  contains  the  indices  of  data  needed  by  processor  Pp  that  are  stored  on 
processor  Pq. 

The  number  of  non-empty  request  lists  each  processor  will  receive  is  equal  to  the 
number  of  messages  that  the  processor  will  send  in  the  exchange  phase.  Each  request 
list  is  placed  in  an  array  indexed  by  the  processor  from  which  the  list  came.  When 
the  scheduler  is  finished,  each  processor  has  an  array  of  request  lists  obtained  from 
other  processors.  The  jth  element  of  this  array  contains  the  request  list  obtained  from 
processor  j.  At  this  point  in  the  execution,  each  processor  P1  knows  which  elements 
of  aloe1  must  be  sent  to  other  processors.  This  information  is  used  to  generate  the 
schedule  S  of  pairs  of  send  and  receive  statements.  These  send/receive  pairs  will 
exchange  the  requested  data  for  either  a  gather  or  a  scatter.  Thus,  both  the  gather 
and  the  scatter  call  the  exchanger  routine.  The  exchanger  is  passed  the  schedule  S 
with  the  required  buffer  space.  It  then  carries  out  the  required  communication. 

2.1.2  Additional  Exchangers 

In  addition  to  the  Level  0  exchanger,  we  have  found  it  useful  to  develop  hybrids  of 
the  gather  and  scatter  that  perform  remote  operations  on  distributed  array  data. 
For  example,  the  scatter_add  adds  data  elements  Di, ...,  Dn^.  to  elements  aloc^kj)  ,.., 
alocJ(knj).  Similar  exchanges  perform  distributed  subtractions,  multiplications  and 
divisions. 

2.2  Level  1  Primitives 

The  Level  1  Primitives  are  the  user  interface  between  an  application  code  and  the 
Level  0  Primitives.  Use  of  Level  1  Primitives  allows  the  dynamic  allocation  of  dis¬ 
tributed  multi-dimensional  arrays  and  supports  data  transfer  between  these  arrays. 
The  Level  1  Primitives  consist  of  declaration  procedures  and  of  communication  pro¬ 
cedures.  The  intermediate  level  PARTI  declaration  procedures  allow  the  user  to 
declare  a  dynamically  allocated  distributed  array  in  a  way  that  allows  specification 
of  how  the  array  is  to  be  partitioned  between  processors.  Coupled  to  these  distributed 
array  declarations  are  the  intermediate  level  gather  and  scatter  procedures.  These 
procedures  are  designed  to  allow  users  to  fetch  or  store  array  elements  from  the 
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distributed  memory  in  a  way  that  does  not  require  the  user  to  keep  track  of  where 
array  elements  are  stored.  This  makes  it  relatively  straightforward  to  write  codes 
that  allow  data  structures  to  be  repartitioned  during  program  execution.  All  mem¬ 
ory  is  allocated  or  declared  in  the  programs  that  call  PARTI  procedures;  memory 
locations  are  passed  to  the  procedures  that  perform  array  initializations.  Users  write 
programs  that  contain  a  combination  of 

1.  code  written  to  execute  on  individual  processors 

2.  communications  calls  that  consist  of  gathers  or  scatters  to  distributed  arrays 

3.  communication  calls  that  consist  of  zero-level  gathers  or  scatters 

4.  send  and  receive  message  passing  calls 

2.2.1  Level  1  Declaration  Procedures 

The  declaration  procedures  in  the  PARTI  primitives  allow  the  user  to  describe  how 
a  data  array  is  mapped  into  the  distributed  memory  of  the  machine.  This  is  ac¬ 
complished  by  specifying  the  mapping  of  the  data  to  a  virtual  processor  array,  then 
describing  the  relationship  between  the  virtual  processor  array  and  the  original  pro¬ 
cessor  array. 

Specified  in  an  array  initialization  is  a  processor  grid  G  of  arbitrary  dimension  and 
size.  This  processor  grid  is  automatically  gray  coded  in  the  current  implementation. 
Embedded  in  G  is  an  array  of  virtual  processors  V.  The  embedding  is  specified  by 
the  user.  This  two-stage  specification  of  processor  sets  allows  embedding  different 
distributed  arrays  into  different  subarrays  V  of  G.  Note  also  that  different  distributed 
arrays  can  be  initialized  with  different  processor  grids  G  or  with  the  same  G  but  with 
different  virtual  processor  subarrays  V. 

Data  arrays  are  mapped  onto  virtual  processor  subarrays  in  any  one  of  a  num¬ 
ber  of  ways.  We  support  tensor  product  mappings  in  which  each  array  dimension 
is  partitioned  independently  or  is  left  undistributed.  Following  [7],  we  support 
blocked  partitionings  where  each  processor  receives  an  equal  number  of  contiguous 
array  elements  along  a  given  dimension,  and  cyclic  distributions  in  which  elements 
are  distributed  stripped  fashion  across  the  processors.  We  also  support  enumerated 
distributions  in  which  mappings  are  supported  by  distributed  translation  tables.  The 
number  of  dimensions  of  an  array  to  be  distributed  must  match  the  dimensionality 
of  G. 

2.2.2  Level  1  Communication  Procedures 

The  Level  1  Gather  exchanger  and  Scatter  exchanger  routines  allow  communi¬ 
cation  of  user  data  based  on  the  global  index  set.  The  Level  1  Scatter  Exchanger 
inputs  lists  of  distributed  array  indices  and  values.  It  places  the  values  in  the  dis¬ 
tributed  memory  locations  specified  by  the  indices  (and  the  initially  supplied  array 
mapping).  The  intermediate  level  Gather  Exchanger  inputs  lists  of  distributed  array 
indices  along  with  a  pointer  to  a  memory  buffer  in  the  calling  processor.  Data  values 
from  the  appropriate  distributed  memory  locations  are  obtained  and  placed  in  the 
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calling  processor’s  bulTer.  An  initialization  or  Scheduler  procedure  call  is  required 
for  Gather  exchanger  or  Scatter  exchanger.  The  initialization  procedure  precom¬ 
putes  the  locations  of  the  data  that  will  to  be  sent  and  received  by  each  processor. 
This  initialization  is  needed  only  once-it  may  be  reused  any  number  of  times.  The 
initialization  will  be  described  first. 

The  initialization  combines  the  mapping  information  provided  by  the  declaration 
routines  with  a  specific  list  of  global  indices.  The  result  is  a  set  of  communication 
calls  coupled  with  some  pre-  or  post-movement  of  the  data  (for  the  gather  and  scatter, 
respectively).  The  communication  plus  movement  is  executed  when  the  gather  or 
scatter  is  called. 

Each  array  A  distributed  with  the  Level  1  Primitives  is  treated  by  the  Level  0 
routines  as  a  set  of  aloe  arrays  defined  on  each  processor.  Elements  of  A  c:e  gathered 
by  calculating  corresponding  indices  of  aloe  arrays  and  then  executing  a  Level  0 
gather. 

The  index  of  a  D-dimensional  array  is  translated  to  a  physical  processor  p  and 
an  index  i  to  aloe.  Index  i  is  generated  from  the  user-specified  index  by  using 
the  mapping  from  the  data  to  the  virtual  processor  array.  The  physical  processor 
P  is  determined  by  locating  the  virtual  processor  in  V,  then  using  the  embedding 
information  to  obtain  the  position  in  G.  The  physical  processor  corresponding  to  the 
calculated  position  in  G  can  be  obtained  from  the  gray  code.  Notice  that  the  phys¬ 
ical  memory  locations  of  the  various  aloes  are  not  used  directly  in  the  calculations. 
Rather,  the  index  i  is  passed  to  processor  P.  This  has  the  advantage  of  allowing  each 
processor  to  store  its  data  in  different  physical  memory  locations. 

The  steps  involved  in  performing  a  gather  on  an  index  list  L  into  A  are  outlined 
here.  Translations  described  above  produce  a  list  of  indices  Lt  and  a  list  of  processors 
Lp  that  correspond  to  L.  These  lists  are  used  to  determine  the  scheduling  of  a  low 
level  gather  and  are  sorted  by  processor.  The  sort  results  in  a  permutation  array 
Lperm  that  reorders  the  elements  of  L.  Since  moving  and/or  copying  of  data  is 
avoided,  the  efficiency  of  the  gather  is  increased.  A  Level  1  gather  is  performed  from 
executing  the  scheduled  zero-level  gather  and  then  using  Lperm  to  reorder  the  values 
obtained  by  each  processor.  In  many  scientific  programs  it  is  also  useful  to  execute 
isomorphic  gathers  on  different  arrays.  Recall  that  actual  memory  addresses  are  only 
bound  to  the  intermediate  and  the  zero-level  primitives  after  all  of  the  optimizations 
are  carried  out.  Consequently,  the  same  zero-level  gather  schedule  and  the  same 
permutation  array  Lperm  can  be  associated  with  a  number  of  different  distributed 
arrays.  Setup  costs  are  amortized  when  the  same  pattern  of  communication  is  carried 
out  many  times. 

The  Level  1  Gather  Exchanger  and  Scatter  Exchanger  procedures  are  designed 
to  allow  users  to  fetch  or  store  array  elements  from  the  distributed  memory  in  a  way 
that  does  not  require  the  user  to  keep  track  of  where  array  elements  are  stored.  It 
is  relatively  straightforward  to  write  codes  that  allow  data  structures  to  be  reparti¬ 
tioned  during  program  execution.  Declaration  procedures  can  be  called  to  repartition 
the  distributed  data,  and  the  application  can  be  written  in  a  partitioning-independent 
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Figure  1:  Two- mesh  refinement. 


3  The  Use  of  PARTI  Primitives  in  Adaptive  and 
Unstructured  Applications 

We  present  one  complete  code  and  one  computational  kernel  to  motivate  the  dis¬ 
cussion  of  our  execution  time  optimizations  and  the  PARTI  (Parallel  Automated 
Runtime  Toolkit  at  ICASE)  primitives. 

3.1  Adaptive  Mesh  Partial  Differential  Equation  Solver 

The  first  example  is  a  technique  for  adaptive  refinement.  The  method  will  be  de¬ 
scribed  in  the  context  of  the  solution  to 

ut  +  f{u)  s  +  g{u)y  —  eAu  =  0 

in  the  presence  of  a  shock  where  the  profile  (detailed  shape)  of  the  shock  is  desired. 
For  the  method  discussed  here,  resolution  of  the  profile  implies  that  a  highly  refined 
grid  must  be  used  in  a  neighborhood  of  the  shock.  The  theory  behind  the  algorithm 
has  been  described  [4]  so  only  an  algorithmic  description  of  the  method  will  be 
presented  here.  In  this  algorithm,  the  structure  of  the  computations  changes  with 
time  and  a  non-uniform  communication  pattern  arises  due  to  the  sharing  of  data 
between  grids. 

The  method  initially  computes  the  solution  on  a  coarse  mesh.  An  error  estimator 
is  then  applied  to  determine  the  regions  that  will  be  covered  by  a  refined  mesh.  An 
example  mesh  from  this  two-level  refinement  is  shown  in  Figure  3.1. 

The  solution  is  time-dependent.  Time- marching  on  the  refined  mesh  is  performed 
by  taking  many  (e.g.  100)  time  steps  on  the  refined  mesh  for  a  single  coarse-grid 
time  step.  Let  U  represent  the  solution,  k  be  the  temporal  step-counter,  and  (z,;) 
represent  the  discrete  location  on  a  spatial  grid.  The  subscript  r  and  r  are  used 
to  refer  to  the  coarse  and  refined  meshes,  respectively.  The  data  structure  used 
for  the  coarse  mesh  is  a  two-dimensional  array.  The  solution  on  the  refined  mesh  is 
represented  by  a  three-dimensional  data  structure  in  which  the  third  index  rep?' 
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a  block  of  the  refined  mesh  (each  block  corresponds  to  a  single  coarse  grid  square), 
and  first  two  indices  represent  the  spatial  location  within  the  block.  The  general 
structure  of  the  kernel  is  outlined  in  Algorithm  1.  In  general,  a  shock  moves  and 


For  kc  =  1  to  K 

I.  Sweep  over  the  coarse  mesh 

A.  Compute  Uc. 

B.  Flag  region  that  should  be  refined. 

II.  If  flagged  region  is  not  empty. 

A.  Modify  shape  of  refined  region 

B.  Interpolate  boundary  values  for  UT  from  Uc. 

C.  For  kT  =  1  to  Kt 

1.  Sweep  over  the  refined  mesh 

2.  Share  values  between  blocks  in  Ur 

D.  Inject  values  of  refined  region  into  coarse  grid 


Algorithm  1  Two-mesh  algorithm. 

changes  shape.  Thus,  the  refined  mesh  will  be  dynamic  -  its  location,  shape,  and  size 
all  change.  This  means  that  both  the  communication  pattern  within  a  distributed 
mesh  and  the  relationship  of  the  two  meshes  will  change  during  the  execution  of  the 
program. 

Classes  of  inter-processor  communication  needed  to  implement  Algorithm  1  in  a 
distributed  computing  environment  are; 

1.  communication  involved  in  coarse  mesh  sweeps  (Step  I. A.), 

2.  communication  involved  in  fine  mesh  sweeps  (Step  II.C.l.), 

3.  sharing  of  values  between  the  coarse  and  fine  meshes(Steps  II. B.  and  II. D . ) , 
and 

4.  communication  required  to  modify  the  shape  of  the  refined  region  (Step  II. A.). 

In  our  implementation  of  Algorithm  1,  the  PARTI  primitives  are  used  in  all  but  the 
last  set  of  communications. 

3.2  Unstructured  Mesh  Kernel 

Figure  2  depicts  a  schematic  outline  of  a  kernel  from  a  fluid  dynamics  simulation.  In 
Section  4.2  we  will  present  experimental  results  obtained  from  a  similar  but  slightly 
more  complex  kernel.  The  kernel  is  based  on  an  algorithm  which  fills  a  computational 
domain  with  irregular  polygons.  Heuristics  designed  to  ensure  that  the  governing  par¬ 
tial  differential  equation  is  solved  with  an  approximately  equal  accuracy  throughout 
the  computational  domain  determine  the  area  and  shape  of  the  polygons.  The  data 
structures  used  in  solving  the  problem  represent  a  bidirectional  graph  where  vertices 
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For  i=l  to  Number-Edges 
I-  vA  =  yold(node(i,l)) 

VB  =  yold(node(i,2)) 

II.  Calculate  flux  using  vA,  vB. 

This  calculation  also  uses  edgedata(i,l),..,  edgedata(i,SMALL) 

III.  y(node(i,l))  =  y(node(i,l))  +  flux 
y(node(i,2))  =  y(node(i,2))  -  flux 

Figure  2:  Outline  of  Computational  Fluid  Dynamics  Unstructured  Mesh  Kernel 

represent  polygons  and  edges  represent  adjacency  of  the  polygons.  Sweeps  over  the 
polygons  are  accomplished  by  traversing  the  edges  of  this  graph.  In  these  codes  we 
solve  for  equation  values  at  graph  vertices. 

The  kernel  outlined  in  Figure  2  computes  the  flux  across  each  graph  edge.  The 
indices  of  the  two  vertices  connected  by  the  j  th  graph  edge  are  denoted  by  node  ( j  ,  1) 
and  node(  j  ,2)  in  Figure  2.  The  computation  of  the  flux  terms  requires  yold(node(i  ,  1) ) 
and  yold(node(i  ,2))  (step  I  in  Figure  2).  The  computation  also  requires  other  in¬ 
formation  concerning  graph  edge  i  (step  II  in  Figure  2).  In  this  example,  we  assume 
that  data  pertaining  to  edge  i  is  placed  in  row  i  of  edgedata.  In  step  III,  flux  is 
added  to  y(node(i,l))  and  flux  is  subtracted  from  y(node(i,2)) 

It  is  necessary  to  decide  how  the  parallel  loop  iterations  for  index  i  are  to  be 
partitioned  between  processors.  We  must  also  specify  how  the  data  in  arrays  y  and 
yold  should  be  partitioned.  No  matter  how  we  partition  loop  iterations  and  data, 
the  structure  of  the  problem  requires  that  we  access  off-processor  elements  of  y  and 
yold.  On  the  other  hand,  edgedatafi, j)  and  node(i,j)  (j  =  1,2)  are  used  only 
in  the  ith  parallel  loop  iteration  so  these  arrays  can  be  partitioned  so  that  only  local 
array  accesses  are  needed. 

In  this  kernel,  the  dependencies  between  elements  of  arrays  y  and  yold  are  deter¬ 
mined  by  integer  array  node.  We  therefore  cannot  accurately  predict  what  data  must 
be  prefetched  until  the  program  executes.  On  distributed  machines,  it  is  typically 
very  inefficient  to  fetch  individual  off-processor  data  as  a  need  for  these  elements  is 
encountered  because  of  high  communications  latencies.  We  will  use  the  level  0  prim¬ 
itives  to  allow  us  to  preschedule  the  communications  needed  to  efficiently  prefetch 
off-processor  data. 

3.3  The  Preprocessing  Phase 

In  Figure  3,  we  outline  the  preprocessing  needed  to  distribute  the  computation 
depicted  in  Figure  2.  In  this  example,  assume  that  y  and  yold  are  mapped  in  an 
identical  manner.  We  also  assume  that  Local(x)  and  Processor(x)  are  functions  that 
return  the  processor  and  local  index  associated  with  the  distributed  array  element 
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1.  For  all  edges  i  assigned  to  processor  P 

if  Processor(node(i,l))  ^  P 

concatenate  Processor(node(i,l))  to  ProcA 
concatenate  Local(node(i,l))  to  Locals 

if  Processor(node(i,2))  ^  P 

concatenate  Processor(node(i,l))  to  Procs 
concatenate  Local(node(i,l))  to  Locals 

II.  Call  Level  0  Scheduler  with  Locals  and  ProcA  (schedule  SA ) 
Call  Level  0  Scheduler  with  Locals  and  Procs  (schedule  Ss ) 


Figure  3:  Preprocessing  Unstructured  CFD  Kernel 

x.  In  this  example,  we  use  only  the  level  0  primitives. 

In  the  first  step  (step  I  in  the  figure)  we  sweep  through  the  edges  assigned  to 
processor  P  and  generate  lists  of  off-processor  references  into  distributed  arrays  y  and 
yold.  On  processor  P,  the  arrays  ProcA  and  LocalA  are  used  to  store  the  processor 
and  local  array  index  that  corresponds  to  each  off-processor  reference  made  by  P  to 
y(node(i,l))  and  yold(node(i ,  1) ) .  Arrays  Procg  and  Locals  are  used  to  store 
off-processor  references  made  by  P  to  y(node(i,2))  and  yoldfnode  (i  ,  2)  )  .  In  step 
II.  in  Figure  3,  the  level  0  scheduler  is  called  with  the  lists  of  processors  and  local 
indices  associated  with  the  first  vertex  ( ProcA  and  LocalA).  The  level  0  scheduler 
is  called  again  with  the  analogous  lists  associated  with  the  second  vertex.  The  two 
level  0  scheduler  calls  produce  schedules  SA  and  Ss  respectively. 

Once  the  schedules  SA  and  have  been  obtained,  we  can  begin  the  sweep 
over  the  graph  edges.  The  schedules  obtained  above  can  be  reused  as  long  as  the 
assignment  of  edges  to  processors  is  not  altered  and  values  assigned  to  the  inte¬ 
ger  array  node  are  not  changed.  In  step  I  in  Figure  4,  copies  of  data  from  off- 
processor  elements  of  array  yold  are  obtained  and  put  into  arrays  read  —  buffer* 
and  read  —  buffers-  In  step  II  in  this  figure,  we  loop  over  all  edges  assigned 
to  processor  P.  In  step  II A,  when  yold(node(i,l))  is  assigned  to  P,  we  can  assign 
yold(Local(nodv(i,l)))  to  v*.  When  yold(node(i,l))  is  stored  off-processor,  v*  must 
be  obtained  from  the  buffer  read  —  buffer*.  The  analogous  conditional  assignment 
is  carried  out  for  vg  in  step  IIB.  In  step  IIC  the  flux  term  is  computed  using  v*  and  vg 
and  edgedatafi  ,  1)  ,  .  .  edgedata(i  ,  SMALL)  .  In  step  IID  operations  on  elements 
of  y  local  to  processor  P  are  preformed.  Operations  on  non-local  elements  of  y  are  de¬ 
ferred.  For  non-local  elements  of  y  assignments  are  made  to  buffers  write  —  buffer* 
and  to  write  —  buffers-  Finally  in  step  III,  the  calculated  flux  elements  are  added 
and  subtracted  from  the  appropriate  off-processor  array  elements  through  the  use 
of  the  scatter^add  and  scatter_subtract  exchangers.  Note  that  we  can  again  use 
schedules  SA  and  Ss- 
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I.  Call  Level  0  Gather  Exchanger  using  (schedule  Sa ), 
place  data  in  read  —  buffer* 

Call  Level  0  Gather  Exchanger  using  (schedule  Sb), 
piace  data  in  read  —  buffers 

II.  For  all  edges  i  assigned  to  processor  P 

A.  if  Processor(node(i,l))  =  P 

v*  =  yold(Local(node(i,i)) 
otherwise 

vA  =  read  —  buf f er*(Acount) 

Acount  =  Acount  +  1 

B.  if  Processor(node(i,2))  =  P 

vp  =  yold(Local(node(i,2)) 
otherwise 

vb  —  reac.  —  buff  erB(Bcount) 

Bcount  =  Bcount  +  1 

C.  Calculate  flux  using  v*,  vg. 

This  calculation  also  uses  edgedata(i,l),..,  edgedata(i, SMALL) 

D.  if  Processor(node(i,l))  =  P 

y(Local(node(i,l))  =  y(Local(node(i,l))  +  flux 
otherwise 

write  —  buf  f  er*(Lcount)  =  flux 
if  Processor(node(i,2))  =  P 

y(Local(node(i,2))  =  y(Local(node(i,2))  +  flux 
otherwise 

write  —  bufferB(Bcount)  =  flux 

III.  scatter-add  exchanger  called  using  schedule  Sa  and  write  —  buffer* 

scatter-subtract  exchanger  called  using  schedule  Sb  and  write  —  buffers 


Figure  4:  Unstructured  Mesh  CFD  Kernel 
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4  Experiments 

The  experiments  described  in  this  paper  used  either  a  32  processor  iPSC/860  machine 
located  at  ICASE  at  NASA  Langley  research  center  or  a  128  processor  iPSC/860  ma¬ 
chine  located  at  Oak  Ridge  National  Laboratories.  Each  processor  had  8  megabytes 
of  memory.  We  used  the  Greenhill  1.8.5  Beta  version  C  compiler  and  the  Greenhill 
1.8.5  Beta  version  4.1  Fortran  compiler  to  generate  code  for  the  80860  processors. 

4.1  Primitives  Benchmark  Timings 

We  first  measure  the  time  required  to  carry  out  level  0  and  level  1  Scheduler,  Gather 
Exchanger  and  Scatter  Exchanger  procedure  calls.  We  use  the  level  1  initialization 
primitive  to  declare  a  128  by  128  element  distributed  array  of  single  precision  num¬ 
bers.  We  allocate  four  processors  configured  in  a  2  by  2  grid  G  and  allocate  an  array 
block  to  each  processor. 

We  use  the  level  1  primitives  to  repeatedly  exchange  information  between  two 
processors  in  the  grid.  We  first  scatter  and  then  gather  lists  of  array  elements.  In 
this  experiment,  we  chose  array  elements  in  n  by  n  sub-blocks  between  the  upper  left 
hand  corner  and  the  lower  left  hand  corner  of  G.  In  performing  this  experiment,  we 
measure  the  time  required  to  carry  out  the  following  procedure  calls: 

1.  Level  0  Scheduler 

2.  Level  0  Scatter 

3.  Level  0  Gather 

4.  Level  1  Scheduler 

5.  Level  1  Scatter 

6.  Level  1  Gather 

7.  iPSC/860  supplied  send  and  receive  pairs  that  exchange  4n2  bytes  of  data 

In  Table  1  we  depict  the  results  of  these  experiments.  We  present  the  time  (in 
milliseconds)  required  to  carry  out  the  requisite  data  exchange  using  send  and  receive 
messages.  We  then  present  the  ratio  between  the  time  taken  by  PARTI  primitive  calls 
and  the  time  taken  by  the  equivalent  send  and  receive  calls.  Table  1  only  presents 
Gather  Exchange  and  Scheduler  calls.  The  Level  0  and  Level  1  Scatter  Exchange  calls 
were  also  timed,  the  results  for  each  Scatter  Exchange  call  were  virtually  identical 
to  that  of  the  corresponding  Gather  Exchange  call. 

We  first  note  that  for  relatively  large  amounts  of  data,  a  Level  0  Gather  Exchange 
takes  a  factor  of  1.2  more  time  than  the  corresponding  send/receive  pair.  A  Level  1 
Gather  Exchange  takes  a  factor  of  1.6  more  time  than  the  corresponding  send/receive 
pair.  Again,  for  relatively  large  amounts  of  data,  it  costs  about  as  much  to  schedule 
a  message  using  the  Level  0  primitives  as  it  takes  to  send  the  message  using  send 
and  receive  messages.  In  contrast,  the  Level  1  Scheduler  is  an  order  of  magnitude 
more  expensive  than  the  corresponding  send/receive  pairs.  This  relatively  high  cost 
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Table 

. :  Overheads  for  Level  0  and  Level  1  Primitives 

Number  of 

Send 

Level  0 

Level  1 

Level  0 

Level  1 

Data 

Re^>e 

Ch  1  uer 

Gather 

Scheduler 

Scheduler 

Elements 

rr 

(ratio) 

(ratio) 

ratio 

100 

' 

1.2 

2.1 

7.0 

400 

1.3 

1.4 

9.2 

900 

1.5 

1.3 

10.7 

1600 

1.6 

1.3 

11.2 

2500 

1.6 

1.1 

11.1 

3600 

-■ 

_  _7: _ 

1.6 

1.0 

11.2 

is  caused  by  the  integer  operations  needed  to  identify  each  reference  to  a  distributed 
two  dimensional  array  with  a  processor  and  local  storage  location. 

4.2  Kernel  from  Unstructured  Mesh  Code 

We  used  an  unstructured  mesh  that  was  generated  to  carry  out  an  aerodynamic 
simulation  involving  a  multielement  airfoil  in  a  landing  configuration  [5].  The  un¬ 
structured  mesh  consists  of  a  highly  non-uniform  scattering  of  mesh  points  joined 
together  by  line  segments  to  form  a  set  of  triangular  elements.  The  algorithm  used  is 
the  Delaunay  Triangulation  algorithm  [13].  Details  of  this  mesh  generation  process 
can  be  found  in  [5].  The  mesh  used  in  this  problem  had  11143  vertices,  and  is  shown 
in  Figure  5.  The  computational  algorithm  we  study  computes  convective  fluxes  using 
a  method  based  on  Roe’s  approximate  Riemann  solver  [14],  [15]. 

The  computational  kernel  used  in  this  experiment  is  very  closely  related  to  the 
kernel  described  in  Figure  2.  In  place  of  each  of  the  arrays  y  and  yold  found  in 
the  kernel  described  in  Section  3.2,  we  have  four  different  arrays  yp,yu,yv,yp  and 
yoldp,  yoldu,  yold7,  yoldp,  respectively.  These  represent  the  density,  pressure,  and 
velocity  components  of  the  solution.  Each  call  to  an  exchanger  in  Figure  4  is  conse¬ 
quently  replaced  by  four  exchanger  calls.  Since  the  data  access  patterns  for  each  of 
these  four  arrays  are  identical,  we  still  need  only  call  the  scheduler  twice,  once  with 
Ptoca  and  Local  a  and  once  with  Procg  and  Locals . 

The  original  program  was  written  in  Fortran,  we  initially  extracted  the  computa¬ 
tional  kernel  and  produced  a  sequential  C  version  of  the  kernel.  The  multiprocessor 
codes  developed  were  also  written  in  C.  When  partitioning  this  kernel,  we  made 
only  modest  efforts  to  reduce  volume  of  needed  communication  and  to  balance  load. 
When  we  employ  P  processors,  we  partition  the  problem  domain  into  P  strips.  All 
the  vertices  Vs  in  the  Sth  strip  are  assigned  to  a  single  processor.  The  strips  are 
chosen  to  evenly  partition  the  sum  of  the  number  of  edges  associated  with  each  Vs- 
In  order  to  allow  us  to  use  the  Level  0  primitives,  we  renumbered  the  mesh  points  so 
that  contiguously  numbered  mesh  points  are  assigned  to  a  processor.  If  both  vertices 
comprising  an  edge  were  assigned  to  a  processor  P,  that  edge  was  also  assigned  to  P. 
If  the  vertices  iq  and  v2  comprising  an  edge  were  assigned  to  two  different  processors 
P  and  Q  respectively,  we  assigned  the  edge  to  processor  Q. 

We  measured  the  time  required  for  an  iPSC/860  multiprocessor  to  compute  the 
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Figure  5:  Unstructured  Mesh  for  Multielement  Airfoil 
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parallelized  computational  kernel  as  well  as  the  time  needed  to  carry  out  the  prepro¬ 
cessing  and  communications  steps  outline  in  Section  3.2. 

In  Table  2  we  first  report  the  computational  rate  in  Mflops  for  the  parallelized 
code  on  2  through  64  processors,  along  with  a  separate  sequential  version  of  the  kernel 
timed  on  a  single  iPSC/860  node.  The  computational  speed  ranged  from  3.2  Mflops 
for  a  single  node  to  87.3  Mflops  on  64  nodes.  We  next  depict  the  total  time  required 
to  generate  lists  of  off-processor  references  and  to  carry  out  the  Level  0  scheduler  calls 
(Steps  I  and  II,  Figure  3).  The  cost  of  finding  the  processor  number  and  the  local 
index  number  associated  with  the  vertices  in  each  edge  took  a  substantial  portion  of 
this  preprocessing  time.  For  two  processors  the  total  preprocessing  was  198  millisec¬ 
onds.  Of  this  total  preprocessing  time,  181  milliseconds  was  required  for  translating 
vertices  to  processors  and  local  indices,  and  only  17  milliseconds  was  needed  both  to 
form  lists  of  off-processor  references  and  to  call  Level  0  schedulers  (refer  to  Table  2). 
When  64  processors  were  used,  total  preprocessing  required  23  milliseconds,  of  which 
16  milliseconds  was  required  for  forming  the  lists  of  off-processor  references  and  for 
calling  the  Level  0  schedulers. 

As  we  mentioned  in  Section  3.2,  the  preprocessing  only  needs  to  be  carried  out 
once  and  can  be  amortized  over  many  unstructured  mesh  flux  calculations.  The  time 
required  to  carry  out  the  flux  calculations  (Figure  4),  is  given  as  total  kernel  time 
in  Table  2.  Note  that  total  kernel  time  does  not  include  the  preprocessing  cost.  The 
time  required  for  preprocessing  ranges  from  21  %  of  the  time  needed  to  compute 
the  computational  kernel  when  2  processors  were  used  to  35  %  of  the  cost  when 
we  used  64  processors.  Finally,  we  report  the  amount  of  time  taken  by  the  Level 
0  exchangers  (Steps  I  and  III  in  Figure  4.  Communication  costs  ranged  from  2% 
to  61%  of  the  total  cost  needed  for  the  kernel  computation  for  2  and  64  processors 
respectively.  The  time  needed  for  communication  in  the  kernel  increases  with  the 
number  of  processors,  the  communication  time  reaches  a  plateau  of  approximately 
40  milliseconds  for  32  and  64  processors.  This  communication  time  could  probably 
be  significantly  reduced  were  we  to  partition  the  domain  in  a  way  that  required  a 
smaller  volume  of  communication.  While  the  Level  1  primitives  embed  data  arrays 
into  processor  arrays  using  a  gray  code,  the  Level  0  primitives  do  not  provide  support 
for  such  an  embedding.  The  increase  in  communication  time  from  2  to  32  processors 
is  largely  due  to  the  non-local  communication  pattern  [2];  we  would  expect  that 
gray  coding  would  improve  performance. 

We  can  define  parallel  efficiency  for  a  given  number  of  processors  P  as  the  sequen¬ 
tial  time  divided  by  the  product  of  the  execution  time  on  P  processors  times  P.  In 
Table  3  we  depict  under  the  heading  of  single  sweep  efficiency,  the  parallel  efficien¬ 
cies  we  would  obtain  were  we  required  to  preprocess  the  kernel  each  time  we  carried 
out  calculations.  In  reality,  preprocessing  time  can  be  amortized  over  multiple  mesh 
sweeps.  If  we  neglect  the  time  required  to  preprocess  the  problem  in  computing 
parallel  efficiencies,  we  obtain  the  second  set  of  parallel  efficiency  measurements  pre¬ 
sented  in  Table  3.  The  amortized  parallel  efficiencies  we  obtain  range  from  99.7  for 
2  processors  to  44.4  for  64  processors. 

The  processor  architecture  of  the  Intel-2/860  includes  a  8K  byte  data  cache. 
We  can  anticipate  that  for  a  fixed  sized  problem,  the  rate  of  computation  on  each 
node  will  tend  to  increase  with  concurrency.  We  quantified  the  performance  effects 


Table  2:  Timings  for  Roe’s  Approximate  Riemann  Solver  of  Unstructured  Mesh  (in 


microsecs) 

Number  of 

Mflops 

Total 

Form  off-processor 

Total  Kernel 

Communication 

Processors 

Preprocessing 

Lists,  Schedule 

Time 

in  Kernel 

Time(ms) 

Time  (ms) 

Time(ms) 

Time(ms) 

1 

3.1 

- 

- 

1845 

- 

2 

6.1 

198 

17 

925 

17 

4 

11.8 

105 

13 

482 

27 

8 

22.0 

56 

9 

258 

31 

16 

38.9 

32 

8 

146 

36 

32 

63.1 

25 

12 

90 

41 

64 

87.3 

23 

16 

65 

40 

Table  3:  Parallel  Efficiencies  for  Roe’s  Approximate  Riemann  Solver  of  Unstructured 
Mesh  _ 


Number  of 

Single  Sweep 

Amortized 

Single  Processor 

Processors 

Efficiency 

Efficiency 

Speed  (Mflops) 

1 

100.0 

100.0 

3.08 

2 

82.1 

99.7 

3.17 

4 

78.6 

95.6 

3.17 

8 

73.4 

89.3 

3.18 

16 

64.8 

80.0 

3.20 

32 

50.1 

64.1 

3.23 

64 

32.8 

44.4 

3.24 

of  the  data  cache  by  employing  the  sequential  program  to  sweep  over  the  edges 
that  we  assigned  to  processor  0  when  we  partitioned  the  problem  between  various 
numbers  of  processors.  In  Table  3,  we  depict  the  results  we  obtained.  The  rate  of 
computation  was  3.08  Mflops  when  we  calculated  the  flux  for  the  entire  problem.  The 
computational  rate  increased  monotonically  as  the  size  of  the  sub-problem  assigned 
to  processor  0  decreased,  reaching  3.24  Mflops  in  the  64  processor  case. 

4.3  Computational  Results  for  Domain  Decomposition  Al¬ 
gorithm 

We  present  computational  results  for  the  parallelized  domain  decomposition  algo¬ 
rithm.  This  domain  decomposition  algorithm  decides  when  to  refine  the  coarse  mesh 
using  an  error  estimator  based  on  the  first  and  second  derivatives  of  the  solution  [4], 
This  program  was  written  in  Fortran. 

Table  4  depicts  the  total  computation  time  in  seconds  required  to  run  the  ex¬ 
ample  problem  with  the  tolerance  ( TOL )  for  the  second  derivative  set  to  21.  The 
computations  were  carried  out  for  a  total  of  440  coarse-grid  time  steps.  Due  to 
memory  constraints,  we  we  unable  to  run  this  problem  on  fewer  than  eight  proces¬ 
sors.  An  conservative  lower  bound  estimate  of  the  sequential  time  tha:  .id  be 


required  to  solve  this  problem  was  obtained  by  generating  S,  the  sum  of  the  time 
all  processors  spent  in  sweeps  over  the  fine  mesh.  S  includes  no  communication  or 
primitive  calls  and  the  code  executed  has  the  same  number  of  operations  as  would  a 
sequential  sweep  over  a  fine  mesh.  The  8  processor  timing  took  3.2  times  as  long  as 
the  4  processor  timing;  the  ratio  between  the  32  processor  timing  and  S  was  22.7. 

We  also  measure  the  total  time  required  by  all  Level  1  schedule  procedure  calls. 
The  scheduling  took  very  little  time;  the  overhead  for  scheduling  ranged  from  0.5  % 
of  the  total  time  (on  8  processors  )  to  4.0  %  of  the  total  time  (on  32  processors).  This 
increase  in  the  cost  of  the  Scheduler  with  increasing  numbers  of  processors  can  be 
explained  by  the  Scheduler’s  global  communications  phase  discussed  in  Section  2.1.1. 
Finally  we  measured  the  time  required  for  carrying  out  gather/scatter  procedure 
calls.  The  time  required  for  the  gather/scatter  procedure  calls  ranged  from  8  %  to 
11  %  of  the  execution  time. 

In  Table  5  we  depict  the  results  obtained  from  running  the  example  problem  on 
32  processors  with  a  range  of  second  derivative  tolerance  (TOL)  values;  the  amount  of 
refinement  increases  as  TOL  decreased.  These  problems  were  all  continued  for  a  total 
of  440  coarse-grid  time  steps.  In  Table  5  we  see  that  for  problems  in  which  there  was 
less  refinement,  the  Level  1  scheduler  required  a  larger  proportion  of  the  total  time. 
The  ratio  of  computation  to  communication  time  did  not  change  significantly  with 
the  amount  of  refinement. 

The  refined  mesh  blocks  were  distributed  among  processors  in  a  round  robin 
fashion.  We  consequently  expect  the  ratio  of  computations  to  data  elements  com¬ 
municated  to  remain  roughly  constant.  As  the  number  of  refined  blocks  increases, 
we  expect  that  each  processor  will  have  to  communicate  with  increasing  numbers 
of  other  processors.  A  more  sophisticated  strategy  for  assigning  refined  blocks  to 
processors  would  be  likely  to  result  in  lower  communication  times  both  by  reducing 
the  volume  and  increasing  the  locality  of  communications. 

The  Level  1  Scheduler  required  only  4  %  of  the  total  time  when  TOL  was  equal 
to  21  but  the  Level  1  scheduler  required  10  %  of  the  total  time  when  TOL  was  equal 
to  27.  As  stated  above,  the  scheduler  has  a  global  communications  phase  whose 
cost  does  not  depend  on  the  amount  of  data  to  be  communicated  by  each  processor. 
The  Estimated  Optimal  Computation  time  depicted  in  Table  5  is  S  divided  by  the 
number  of  processors.  This  gives  a  rough  estimate  of  the  time  that  would  be  required 
to  solve  this  problem  were  we  to  have  attained  linear  speedup. 
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Table  5:  Adaptive  Solver-  Varying  Tolerance  (32  processors) 


Tolerence 

Total 

Time 

Est.  Optimal 
Comp  Time 
(seconds) 

Scheduling 

Time 

(seconds) 

Communication 

Time 

(seconds) 

21 

248 

176 

10 

27 

23 

186 

130 

9 

19 

25 

130 

91 

7 

15 

27 

92 

57 

6 

9 

5  Conclusion 

The  ideas  incorporated  in  PARTI  are  specifically  aimed  at  computations  on  dis¬ 
tributed  memory  machines  in  which  the  structure  of  the  computation  depends  on 
the  input  data.  Run-time  support  must  be  incorporated  as  part  of  the  distributed 
implementation  of  such  computations. 

The  PARTI  primitives  allow  the  user  to  describe  how  a  data  array  is  mapped 
into  the  distributed  memory  of  the  machine.  This  is  accomplished  by  specifying  the 
mapping  of  the  data  to  a  virtual  processor  array,  then  describing  the  relationship  be¬ 
tween  the  virtual  processor  array  and  the  original  processor  array.  These  primitives 
carry  out  scheduling  operations  that  make  it  possible  to  efficiently  carry  out  gather 
and  scatter  operations  on  distributed  arrays  using  global  indices.  To  illustrate  the 
performance  tradeoffs  encountered  in  carrying  out  these  optimizations,  we  presented 
benchmark  results  from  an  unstructured  mesh  kernel,  and  an  adaptive  partial  dif¬ 
ferential  equation  solver.  Our  results  suggest  that  these  primitives  carry  out  needed 
optimizations  at  a  relatively  low  cost. 
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